How do we build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan. How do we evaluate the performance of such a model?
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and to identify which segment of customers to target more.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split,GridSearchCV
# To build model for prediction
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
make_scorer,
roc_auc_score,
roc_curve,
precision_recall_curve
)
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# setting the precision of floating numbers to 5 decimal points
pd.set_option("display.float_format", lambda x: "%.5f" % x)
loan_df = pd.read_csv("Loan_Modelling.csv")
loan_df.head(5)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.60000 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.50000 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.00000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.70000 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.00000 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
loan_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
# Write your code here
print(f"There are {loan_df.shape[0]} rows and {loan_df.shape[1]} columns in the dataset.")
There are 5000 rows and 14 columns in the dataset.
loan_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
| # | Column | Non-Null Count | Datatype |
|---|---|---|---|
| 0 | ID | 5000 non-null | int64 |
| 1 | Age | 5000 non-null | int64 |
| 2 | Experience | 5000 non-null | int64 |
| 3 | Income | 5000 non-null | int64 |
| 4 | ZIPCode | 5000 non-null | int64 |
| 5 | Family | 5000 non-null | int64 |
| 6 | CCAvg | 5000 non-null | float64 |
| 7 | Education | 5000 non-null | int64 |
| 8 | Mortgage | 5000 non-null | int64 |
| 9 | Personal_Loan | 5000 non-null | int64 |
| 10 | Securities_Account | 5000 non-null | int64 |
| 11 | CD_Account | 5000 non-null | int64 |
| 12 | Online | 5000 non-null | int64 |
| 13 | CreditCard | 5000 non-null | int64 |
The 14 columns are divided into 2 different data types:
int64 (13 columns):
ID, Age, Experience, Income, ZIPCode, Family, Education, Mortgage, Personal_Loan, Securities_Account, CD_Account, Online, CreditCard
float64 (1 column):
CCAvg
Notes:
ID and ZIPCode are identifier/location variables and should be treated as categorical in nature.Personal_Loan, Securities_Account, CD_Account, Online, and CreditCard are binary categorical variables.Education may represent ordinal categories and can be handled accordingly in analysis.# Write your code here
# Preliminary assessment to check for null values
loan_df.isnull().sum()
ID 0 Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 dtype: int64
There are no null values
loan_df.nunique()
ID 5000 Age 45 Experience 47 Income 162 ZIPCode 467 Family 4 CCAvg 108 Education 3 Mortgage 347 Personal_Loan 2 Securities_Account 2 CD_Account 2 Online 2 CreditCard 2 dtype: int64
# Check for duplicated values
loan_df.duplicated().sum()
loan_df.loc[loan_df.duplicated()]
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard |
|---|
There are no duplicated values
loan_df.describe(include="all").T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 5000.00000 | 2500.50000 | 1443.52000 | 1.00000 | 1250.75000 | 2500.50000 | 3750.25000 | 5000.00000 |
| Age | 5000.00000 | 45.33840 | 11.46317 | 23.00000 | 35.00000 | 45.00000 | 55.00000 | 67.00000 |
| Experience | 5000.00000 | 20.10460 | 11.46795 | -3.00000 | 10.00000 | 20.00000 | 30.00000 | 43.00000 |
| Income | 5000.00000 | 73.77420 | 46.03373 | 8.00000 | 39.00000 | 64.00000 | 98.00000 | 224.00000 |
| ZIPCode | 5000.00000 | 93169.25700 | 1759.45509 | 90005.00000 | 91911.00000 | 93437.00000 | 94608.00000 | 96651.00000 |
| Family | 5000.00000 | 2.39640 | 1.14766 | 1.00000 | 1.00000 | 2.00000 | 3.00000 | 4.00000 |
| CCAvg | 5000.00000 | 1.93794 | 1.74766 | 0.00000 | 0.70000 | 1.50000 | 2.50000 | 10.00000 |
| Education | 5000.00000 | 1.88100 | 0.83987 | 1.00000 | 1.00000 | 2.00000 | 3.00000 | 3.00000 |
| Mortgage | 5000.00000 | 56.49880 | 101.71380 | 0.00000 | 0.00000 | 0.00000 | 101.00000 | 635.00000 |
| Personal_Loan | 5000.00000 | 0.09600 | 0.29462 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 |
| Securities_Account | 5000.00000 | 0.10440 | 0.30581 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 |
| CD_Account | 5000.00000 | 0.06040 | 0.23825 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 |
| Online | 5000.00000 | 0.59680 | 0.49059 | 0.00000 | 0.00000 | 1.00000 | 1.00000 | 1.00000 |
| CreditCard | 5000.00000 | 0.29400 | 0.45564 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 1.00000 |
## Age
plt.figure(figsize=(8,4))
sns.histplot(loan_df['Age'], kde=True)
plt.title('Distribution of Age')
plt.show()
plt.figure(figsize=(6,2))
sns.boxplot(x=loan_df['Age'])
plt.title('Boxplot of Age')
plt.show()
This plot shows the frequency of ages with a KDE (Kernel Density Estimate) overlay to represent the distribution curve.
Age appears to be mostly between 25 and 65 years.
The KDE curve indicates a fairly uniform distribution with mild peaks, not heavily skewed.
Median (Q2) is around 45.
Interquartile Range (IQR) spans roughly from 37 to 58.
Minimum is around 25 and maximum around 65.
There appear to be no significant outliers, as no individual points are plotted outside the whiskers.
## Experience
plt.figure(figsize=(6,4))
sns.histplot(loan_df['Experience'], kde=True)
plt.title('Distribution of Experience')
plt.show()
plt.figure(figsize=(6,2))
sns.boxplot(x=loan_df['Experience'])
plt.title('Boxplot of Experience')
plt.show()
Range: The experience values range from 0 to around 45 years.
Shape: The distribution is relatively uniform, with some fluctuations.
There’s no strong skew — the data appears somewhat evenly spread across bins.
Peaks: Slight peaks are visible around the 5, 15, 25, and 35 years mark.
DE Line (Smoothed Curve): The Kernel Density Estimate line is not perfectly smooth, suggesting a multimodal distribution (multiple local peaks).
Spread:The interquartile range (IQR) appears to lie approximately between 10 and 35 years.
Median: The median (Q2) is around 20 years.
Whiskers: The lower whisker is near 0, and the upper whisker extends to ~45.
Outliers: There are no extreme outliers in this boxplot; the whiskers cover the entire range.
## Income
plt.figure(figsize=(6,4))
sns.histplot(loan_df['Income'], kde=True)
plt.title('Distribution of Income')
plt.show()
plt.figure(figsize=(6,2))
sns.boxplot(x=loan_df['Income'])
plt.title('Boxplot of Income')
plt.show()
Skewness: The distribution is right-skewed (positively skewed) — most values are concentrated on the lower end, with a long tail extending toward higher incomes.
Peak: A high concentration of income values lies between 30 to 70 units.
Spread: While many individuals earn less than 100 units, there are values extending well beyond 200 units, but they are less frequent.
IQR (Box):The interquartile range spans approximately 40 to 110 units.
This is where the middle 50% of income values lie.
Median: The median income is around 70–75 units, indicating that half of the data lies below this value.
Whiskers:The whiskers stretch roughly from 10 to 160, showing the range of typical values.
Outliers:There are several outliers beyond 160, shown as black dots. These are unusually high incomes compared to the rest.
# Family
plt.figure(figsize=(6,4))
sns.countplot(x=loan_df['Family'])
plt.title('Countplot of Family')
plt.show()
This is a categorical variable (discrete values: 1, 2, 3, 4).
It may be used as-is in models via encoding (e.g., one-hot or ordinal encoding), or dropped if not important.
# CCAvg
plt.figure(figsize=(6,4))
sns.histplot(loan_df['CCAvg'], kde=True)
plt.title('Distribution of CCAvg')
plt.show()
plt.figure(figsize=(6,2))
sns.boxplot(x=loan_df['CCAvg'])
plt.title('Boxplot of CCAvg')
plt.show()
The distribution is right-skewed (positively skewed).
This means most people spend less on average using credit cards, with fewer individuals spending higher amounts.
A large number of observations fall between 0 to 2.5 (likely in $000s/month).
Very few individuals have average spends greater than 5, indicating a long right tail.
The interquartile range (IQR) appears to be between ~0.5 and ~2.5.
The median is slightly below 2.
## Education
plt.figure(figsize=(6,4))
sns.countplot(x=loan_df['Education'])
plt.title('Countplot of Education')
plt.show()
Categorical Distribution:The education variable has 3 categories: 1, 2, and 3. These are likely encoded values representing levels of education (e.g., High School, Graduate, Postgraduate).
Dominant Category:Category 1 has the highest number of individuals, suggesting that the largest group has the lowest education level (assuming 1 = lowest).
Other Categories:Categories 2 and 3 have comparable frequencies, but both are significantly lower than category 1.
## Mortgage
plt.figure(figsize=(6,4))
sns.histplot(loan_df['Mortgage'], kde=True)
plt.title('Distribution of Mortgage')
plt.show()
plt.figure(figsize=(6,2))
sns.boxplot(x=loan_df['Mortgage'])
plt.title('Boxplot of Mortgage')
plt.show()
Most individuals have low or zero mortgage amounts.
There's a sharp peak near 0, indicating a large proportion of the population doesn't have a mortgage.
The distribution stretches far to the right (values above 100, even exceeding 600).
The boxplot confirms many outliers, representing individuals with unusually high mortgage values.
# Personal_Loan
plt.figure(figsize=(6,4))
sns.countplot(x=loan_df['Personal_Loan'])
plt.title('Countplot of Personal_Loan')
plt.show()
plt.figure(figsize=(6,6))
loan_df['Personal_Loan'].value_counts().plot.pie(autopct='%1.1f%%', startangle=90)
plt.title('Pie Chart of Personal Loan')
plt.show()
The variable has two classes:
0 (Did not accept personal loan)
1 (Accepted personal loan)
A significant imbalance is evident:
90.4% of customers did not accept a personal loan.
Only 9.6% accepted a personal loan.
This imbalance can negatively impact classification models if not handled properly.
## Securities_Account
plt.figure(figsize=(6,4))
sns.countplot(x=loan_df['Securities_Account'])
plt.title('Countplot of Securities_Account')
plt.show()
Binary Feature:
0 = Customer does not have a securities account
1 = Customer has a securities account
Class Distribution:
The vast majority of customers (~90%) do not have a securities account.
Only a small minority (~10%) have a securities account.
Skewed Distribution:
## CD_Account
plt.figure(figsize=(6,4))
sns.countplot(x=loan_df['CD_Account'])
plt.title('Countplot of CD_Account')
plt.show()
plt.figure(figsize=(6,6))
loan_df['CD_Account'].value_counts().plot.pie(autopct='%1.1f%%', startangle=90)
plt.title('Pie Chart of CD_Account')
plt.show()
Binary Feature:
0 = Customer does not hold a Certificate of Deposit (CD) account
1 = Customer holds a CD account
Class Distribution:
94% of customers do not have a CD account.
Only 6% of customers do have a CD account.
Highly Imbalanced:
This feature is extremely skewed toward customers not holding CD accounts.
Might require special handling (e.g., feature engineering or balancing) if used in modeling.
## Online
plt.figure(figsize=(6,4))
sns.countplot(x=loan_df['Online'])
plt.title('Countplot of Online')
plt.show()
The image displays a countplot visualizing the distribution of the 'Online' feature from a loan dataset (loan_df['Online']).
X-axis: Represents the 'Online' feature, with two categories: 0 and 1. These likely represent whether a loan application was submitted offline (0) or online (1).
Bar Heights:
There are significantly more loan applications submitted online (approximately 3000) compared to those submitted offline (approximately 2000) in this dataset. Online submissions appear to be the more prevalent method for loan applications.
# CreditCard
plt.figure(figsize=(6,4))
sns.countplot(x=loan_df['CreditCard'])
plt.title('Countplot of CreditCard')
plt.show()
plt.figure(figsize=(6,6))
loan_df['CreditCard'].value_counts().plot.pie(autopct='%1.1f%%', startangle=90)
plt.title('Pie Chart of CreditCard')
plt.show()
The observation explicitly stated in the code comment is supported by both plots: "Slightly more customers without credit cards."
Specifically, the data shows that approximately twice as many customers in this loan dataset do not have a credit card compared to those who do. The countplot shows roughly 3000 vs. 1500, and the pie chart shows approximately 70.6% vs. 29.4%
colors = ['#1f77b4', '#ff7f0e']
# Write the code here
loan_df.head()
# Correlation Heatmap
plt.figure(figsize=(10,6))
sns.heatmap(loan_df.corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()
This heatmap visualizes the pairwise correlations between different numerical features in the loan_df dataset. The color intensity and the numerical values within each cell indicate the strength and direction of the correlation.
Strong Positive Correlations (Values close to +1, reddish colors):
Age and Experience (0.99): There is a very strong positive correlation between the age of the applicant and their years of experience. This is expected as older individuals tend to have more work experience.
Strong Negative Correlations (Values close to -1, bluish colors):
There are no strong negative correlations observed in this heatmap. The negative correlation values are generally weak (close to 0).
Weak Correlations (Values close to 0, lighter colors):
Most other pairs of features show weak correlations, indicating a limited linear relationship between them. For example, the correlation between 'ID' and most other features is very close to zero, as 'ID' is likely just an identifier.
loan_df.nunique()
ID 5000 Age 45 Experience 47 Income 162 ZIPCode 467 Family 4 CCAvg 108 Education 3 Mortgage 347 Personal_Loan 2 Securities_Account 2 CD_Account 2 Online 2 CreditCard 2 dtype: int64
# Code for pair plot
sns.pairplot(loan_df[["Income", "CCAvg", "Mortgage", "Age", "Experience", "ZIPCode","Personal_Loan"]],
hue='Personal_Loan', palette=colors)
plt.suptitle('Pairwise Relationships', y=1.02)
plt.show()
# Age vs Personal_Loan
plt.figure(figsize=(8,5))
sns.boxplot(x='Personal_Loan', y='Age', data=loan_df, palette=colors)
plt.title('Age vs Personal Loan')
plt.show()
# Relationship: Income vs CCAvg vs Personal Loan
plt.figure(figsize=(10,6))
sns.scatterplot(x='Income', y='CCAvg', hue='Personal_Loan', data=loan_df, palette=colors)
plt.title('Income vs CCAvg colored by Personal Loan')
plt.show()
# Mortgage vs Income vs Loan Acceptance
plt.figure(figsize=(10,6))
sns.scatterplot(x='Mortgage', y='Income', hue='Personal_Loan', data=loan_df, palette=colors)
plt.title('Mortgage vs Income colored by Personal Loan')
plt.show()
# CD_Account and Securities_Account combined impact
plt.figure(figsize=(10,6))
sns.countplot(x='CD_Account', hue='Securities_Account', data=loan_df, palette='rocket')
plt.title('CD Account vs Securities Account')
plt.show()
# CD_Account and Securities_Account combined impact
plt.figure(figsize=(8,6))
sns.countplot(x='Personal_Loan', hue='Family', data=loan_df, palette='rocket')
plt.title('Personal_Loan vs Family')
plt.show()
# Age vs. Personal_Loan
plt.figure(figsize=(8, 6))
sns.boxplot(x='Personal_Loan', y='Age', data=loan_df, palette='viridis')
plt.title('Boxplot of Age vs. Personal Loan')
plt.xlabel('Personal Loan Accepted (0: No, 1: Yes)')
plt.ylabel('Age')
plt.show()
# Age vs. Personal_Loan (Violin Plot)
plt.figure(figsize=(8, 6))
sns.violinplot(x='Personal_Loan', y='Age', data=loan_df, palette='viridis')
plt.title('Violin Plot of Age vs. Personal Loan')
plt.xlabel('Personal Loan Accepted (0: No, 1: Yes)')
plt.ylabel('Age')
plt.show()
# Experience vs. Personal_Loan
plt.figure(figsize=(8, 6))
sns.boxplot(x='Personal_Loan', y='Experience', data=loan_df, palette='viridis')
plt.title('Boxplot of Experience vs. Personal Loan')
plt.xlabel('Personal Loan Accepted (0: No, 1: Yes)')
plt.ylabel('Experience')
plt.show()
# Experience vs. Personal_Loan (Violin Plot)
plt.figure(figsize=(8, 6))
sns.violinplot(x='Personal_Loan', y='Experience', data=loan_df, palette='viridis')
plt.title('Violin Plot of Experience vs. Personal Loan')
plt.xlabel('Personal Loan Accepted (0: No, 1: Yes)')
plt.ylabel('Experience')
plt.show()
# Income vs. Personal_Loan
plt.figure(figsize=(8, 6))
sns.boxplot(x='Personal_Loan', y='Income', data=loan_df, palette='viridis')
plt.title('Boxplot of Income vs. Personal Loan')
plt.xlabel('Personal Loan Accepted (0: No, 1: Yes)')
plt.ylabel('Income')
plt.show()
# Income vs. Personal_Loan (Violin Plot)
plt.figure(figsize=(8, 6))
sns.violinplot(x='Personal_Loan', y='Income', data=loan_df, palette='viridis')
plt.title('Violin Plot of Income vs. Personal Loan')
plt.xlabel('Personal Loan Accepted (0: No, 1: Yes)')
plt.ylabel('Income')
plt.show()
# CCAvg vs. Personal_Loan
plt.figure(figsize=(8, 6))
sns.boxplot(x='Personal_Loan', y='CCAvg', data=loan_df, palette='viridis')
plt.title('Boxplot of CCAvg vs. Personal Loan')
plt.xlabel('Personal Loan Accepted (0: No, 1: Yes)')
plt.ylabel('CCAvg')
plt.show()
# CCAvg vs. Personal_Loan (Violin Plot)
plt.figure(figsize=(8, 6))
sns.violinplot(x='Personal_Loan', y='CCAvg', data=loan_df, palette='viridis')
plt.title('Violin Plot of CCAvg vs. Personal Loan')
plt.xlabel('Personal Loan Accepted (0: No, 1: Yes)')
plt.ylabel('CCAvg')
plt.show()
# Mortgage vs. Personal_Loan
plt.figure(figsize=(8, 6))
sns.boxplot(x='Personal_Loan', y='Mortgage', data=loan_df, palette='viridis')
plt.title('Boxplot of Mortgage vs. Personal Loan')
plt.xlabel('Personal Loan Accepted (0: No, 1: Yes)')
plt.ylabel('Mortgage')
plt.show()
# Mortgage vs. Personal_Loan (Violin Plot)
plt.figure(figsize=(8, 6))
sns.violinplot(x='Personal_Loan', y='Mortgage', data=loan_df, palette='viridis')
plt.title('Violin Plot of Mortgage vs. Personal Loan')
plt.xlabel('Personal Loan Accepted (0: No, 1: Yes)')
plt.ylabel('Mortgage')
plt.show()
# Family vs. Personal_Loan
plt.figure(figsize=(8, 6))
sns.countplot(x='Family', hue='Personal_Loan', data=loan_df, palette='viridis')
plt.title('Countplot of Family vs. Personal Loan')
plt.xlabel('Family Size')
plt.ylabel('Count')
plt.legend(title='Personal Loan', labels=['No', 'Yes'])
plt.show()
# Calculate and print percentage distribution
cross_tab = pd.crosstab(loan_df['Family'], loan_df['Personal_Loan'], normalize='index') * 100
print('\nPercentage of Personal Loan Acceptance by Family:\n', cross_tab)
Percentage of Personal Loan Acceptance by Family: Personal_Loan 0 1 Family 1 92.73098 7.26902 2 91.82099 8.17901 3 86.83168 13.16832 4 89.03437 10.96563
# Family vs. Personal_Loan
plt.figure(figsize=(8, 6))
sns.countplot(x='Family', hue='Personal_Loan', data=loan_df, palette='viridis')
plt.title('Countplot of Family vs. Personal Loan')
plt.xlabel('Family Size')
plt.ylabel('Count')
plt.legend(title='Personal Loan', labels=['No', 'Yes'])
plt.show()
# Calculate and print percentage distribution
cross_tab = pd.crosstab(loan_df['Family'], loan_df['Personal_Loan'], normalize='index') * 100
print('\nPercentage of Personal Loan Acceptance by Family:\n', cross_tab)
Percentage of Personal Loan Acceptance by Family: Personal_Loan 0 1 Family 1 92.73098 7.26902 2 91.82099 8.17901 3 86.83168 13.16832 4 89.03437 10.96563
# Education vs. Personal_Loan
plt.figure(figsize=(8, 6))
sns.countplot(x='Education', hue='Personal_Loan', data=loan_df, palette='viridis')
plt.title('Countplot of Education vs. Personal Loan')
plt.xlabel('Education Level')
plt.ylabel('Count')
plt.legend(title='Personal Loan', labels=['No', 'Yes'])
plt.show()
# Calculate and print percentage distribution
cross_tab = pd.crosstab(loan_df['Education'], loan_df['Personal_Loan'], normalize='index') * 100
print('\nPercentage of Personal Loan Acceptance by Education:\n', cross_tab)
Percentage of Personal Loan Acceptance by Education: Personal_Loan 0 1 Education 1 95.56298 4.43702 2 87.02780 12.97220 3 86.34244 13.65756
# Securities_Account vs. Personal_Loan
plt.figure(figsize=(8, 6))
sns.countplot(x='Securities_Account', hue='Personal_Loan', data=loan_df, palette='viridis')
plt.title('Countplot of Securities_Account vs. Personal Loan')
plt.xlabel('Securities_Account (0: No, 1: Yes)')
plt.ylabel('Count')
plt.legend(title='Personal Loan', labels=['No', 'Yes'])
plt.show()
# Calculate and print percentage distribution
cross_tab = pd.crosstab(loan_df['Securities_Account'], loan_df['Personal_Loan'], normalize='index') * 100
print('\nPercentage of Personal Loan Acceptance by Securities_Account:\n', cross_tab)
Percentage of Personal Loan Acceptance by Securities_Account: Personal_Loan 0 1 Securities_Account 0 90.62081 9.37919 1 88.50575 11.49425
# CD_Account vs. Personal_Loan
plt.figure(figsize=(8, 6))
sns.countplot(x='CD_Account', hue='Personal_Loan', data=loan_df, palette='viridis')
plt.title('Countplot of CD_Account vs. Personal Loan')
plt.xlabel('CD_Account (0: No, 1: Yes)')
plt.ylabel('Count')
plt.legend(title='Personal Loan', labels=['No', 'Yes'])
plt.show()
# Calculate and print percentage distribution
cross_tab = pd.crosstab(loan_df['CD_Account'], loan_df['Personal_Loan'], normalize='index') * 100
print('\nPercentage of Personal Loan Acceptance by CD_Account:\n', cross_tab)
Percentage of Personal Loan Acceptance by CD_Account: Personal_Loan 0 1 CD_Account 0 92.76288 7.23712 1 53.64238 46.35762
# Online vs. Personal_Loan
plt.figure(figsize=(8, 6))
sns.countplot(x='Online', hue='Personal_Loan', data=loan_df, palette='viridis')
plt.title('Countplot of Online vs. Personal Loan')
plt.xlabel('Online (0: No, 1: Yes)')
plt.ylabel('Count')
plt.legend(title='Personal Loan', labels=['No', 'Yes'])
plt.show()
# Calculate and print percentage distribution
cross_tab = pd.crosstab(loan_df['Online'], loan_df['Personal_Loan'], normalize='index') * 100
print('\nPercentage of Personal Loan Acceptance by Online:\n', cross_tab)
Percentage of Personal Loan Acceptance by Online: Personal_Loan 0 1 Online 0 90.62500 9.37500 1 90.24799 9.75201
# CreditCard vs. Personal_Loan
plt.figure(figsize=(8, 6))
sns.countplot(x='CreditCard', hue='Personal_Loan', data=loan_df, palette='viridis')
plt.title('Countplot of CreditCard vs. Personal Loan')
plt.xlabel('CreditCard (0: No, 1: Yes)')
plt.ylabel('Count')
plt.legend(title='Personal Loan', labels=['No', 'Yes'])
plt.show()
# Calculate and print percentage distribution
cross_tab = pd.crosstab(loan_df['CreditCard'], loan_df['Personal_Loan'], normalize='index') * 100
print('\nPercentage of Personal Loan Acceptance by CreditCard:\n', cross_tab)
Percentage of Personal Loan Acceptance by CreditCard: Personal_Loan 0 1 CreditCard 0 90.45326 9.54674 1 90.27211 9.72789
# Copying the data to avoid touching original DF
df = loan_df.copy()
#preview it
df.sample(10)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 441 | 442 | 52 | 27 | 43 | 93555 | 1 | 1.30000 | 2 | 0 | 0 | 1 | 0 | 1 | 0 |
| 4470 | 4471 | 44 | 20 | 111 | 91911 | 2 | 5.30000 | 2 | 0 | 1 | 0 | 0 | 1 | 0 |
| 1790 | 1791 | 44 | 20 | 43 | 92124 | 1 | 0.30000 | 3 | 0 | 0 | 1 | 0 | 0 | 1 |
| 3659 | 3660 | 33 | 7 | 22 | 94002 | 1 | 0.40000 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1429 | 1430 | 31 | 5 | 35 | 95064 | 1 | 0.60000 | 3 | 171 | 0 | 0 | 0 | 0 | 0 |
| 3967 | 3968 | 40 | 15 | 22 | 94306 | 1 | 0.60000 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4656 | 4657 | 47 | 21 | 38 | 91101 | 3 | 0.60000 | 2 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1603 | 1604 | 36 | 6 | 138 | 92152 | 1 | 7.00000 | 3 | 86 | 1 | 0 | 0 | 1 | 0 |
| 357 | 358 | 38 | 14 | 42 | 94610 | 1 | 2.00000 | 2 | 81 | 0 | 0 | 0 | 1 | 0 |
| 334 | 335 | 48 | 23 | 45 | 95053 | 1 | 1.30000 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
Dropping Zicode column as it has very low correlation with the dependent variable
# Dropping Zicode column as it has very low correlation with the dependent variable
df.drop("ZIPCode",axis=1,inplace=True)
# Checking how many negative values on the experience column
print("We have %2d negative values on the experience column" % df[df['Experience'] < 0].shape[0])
We have 52 negative values on the experience column
#Convert all to Zeros
df['Experience'] = df['Experience'].apply(lambda x : 0 if x < 0 else x)
#Input with median
df['Experience'] = df['Experience'].replace(0,df['Experience'].median())
# Checking how many negative values on the experience column after treatment
print("We now have %2d negative values on the experience column after treatment" % df[df['Experience'] < 0].shape[0])
We now have 0 negative values on the experience column after treatment
df.head()
| ID | Age | Experience | Income | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 4 | 1.60000 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 3 | 1.50000 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 1 | 1.00000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 1 | 2.70000 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 4 | 1.00000 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
X = df.drop(["Personal_Loan"], axis=1)
y = df["Personal_Loan"]
# Creating dummy variables for categorical features
X = pd.get_dummies(X, drop_first=True)
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.30, random_state=1
)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("\n Percentage of classes in training set : ")
print(y_train.value_counts(normalize=True))
print("\n Percentage of classes in test set : ")
print(y_test.value_counts(normalize=True))
Shape of Training set : (3500, 12) Shape of test set : (1500, 12) Percentage of classes in training set : 0 0.90543 1 0.09457 Name: Personal_Loan, dtype: float64 Percentage of classes in test set : 0 0.90067 1 0.09933 Name: Personal_Loan, dtype: float64
When evaluating the model’s performance, we consider two types of incorrect predictions:
The model predicts that a customer will take a personal loan, but the customer actually does not.
The model predicts that a customer will not take a personal loan, but the customer actually does.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# Building a Decision Tree using default 'gini' criteria to split
model = DecisionTreeClassifier(
criterion="gini", class_weight={0: 0.15, 1: 0.85}, random_state=1
) # defining higher weightage to 1 in the dependent variable
# Fitting the model on the training set
model.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=1)confusion_matrix_sklearn(model, X_train, y_train)
decision_tree_perf_train = model_performance_classification(model, X_train, y_train)
decision_tree_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.00000 | 1.00000 | 1.00000 | 1.00000 |
confusion_matrix_sklearn(model, X_test, y_test)
decision_tree_perf_test = model_performance_classification(model, X_test, y_test)
decision_tree_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.98067 | 0.89262 | 0.91096 | 0.90169 |
# Creating a list of column names
feature_names = X_train.columns.to_list()
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree
print(tree.export_text(model, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [374.10, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- CCAvg <= 3.95 | | | | |--- Income <= 81.50 | | | | | |--- Experience <= 12.50 | | | | | | |--- Family <= 3.50 | | | | | | | |--- CCAvg <= 3.50 | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | |--- CCAvg > 3.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- Family > 3.50 | | | | | | | |--- Income <= 66.50 | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | |--- Income > 66.50 | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | |--- Experience > 12.50 | | | | | | |--- ID <= 1182.00 | | | | | | | |--- ID <= 1020.50 | | | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | | | |--- ID > 1020.50 | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | |--- ID > 1182.00 | | | | | | | |--- Experience <= 13.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- Experience > 13.50 | | | | | | | | |--- weights: [5.10, 0.00] class: 0 | | | | |--- Income > 81.50 | | | | | |--- ID <= 934.50 | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | |--- ID > 934.50 | | | | | | |--- CCAvg <= 3.05 | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | |--- CCAvg > 3.05 | | | | | | | |--- Mortgage <= 162.00 | | | | | | | | |--- Securities_Account <= 0.50 | | | | | | | | | |--- ID <= 3334.00 | | | | | | | | | | |--- ID <= 1748.00 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- ID > 1748.00 | | | | | | | | | | | |--- weights: [1.05, 0.00] class: 0 | | | | | | | | | |--- ID > 3334.00 | | | | | | | | | | |--- weights: [0.00, 5.95] class: 1 | | | | | | | | |--- Securities_Account > 0.50 | | | | | | | | | |--- Income <= 84.00 | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | |--- Income > 84.00 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- Mortgage > 162.00 | | | | | | | | |--- Experience <= 9.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- Experience > 9.50 | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | |--- CCAvg > 3.95 | | | | |--- weights: [6.75, 0.00] class: 0 | | |--- CD_Account > 0.50 | | | |--- ID <= 766.50 | | | | |--- weights: [0.15, 0.00] class: 0 | | | |--- ID > 766.50 | | | | |--- weights: [0.00, 6.80] class: 1 |--- Income > 98.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- Income <= 100.00 | | | | |--- CCAvg <= 4.20 | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | |--- CCAvg > 4.20 | | | | | |--- Experience <= 29.50 | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | |--- Experience > 29.50 | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | |--- Income > 100.00 | | | | |--- Income <= 103.50 | | | | | |--- Securities_Account <= 0.50 | | | | | | |--- weights: [2.10, 0.00] class: 0 | | | | | |--- Securities_Account > 0.50 | | | | | | |--- CCAvg <= 3.06 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- CCAvg > 3.06 | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | |--- Income > 103.50 | | | | | |--- Securities_Account <= 0.50 | | | | | | |--- weights: [57.90, 0.00] class: 0 | | | | | |--- Securities_Account > 0.50 | | | | | | |--- weights: [7.05, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- Income <= 108.50 | | | | |--- Experience <= 3.50 | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | |--- Experience > 3.50 | | | | | |--- weights: [1.20, 0.00] class: 0 | | | |--- Income > 108.50 | | | | |--- Age <= 26.00 | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | |--- Age > 26.00 | | | | | |--- Income <= 113.50 | | | | | | |--- ID <= 2746.00 | | | | | | | |--- weights: [0.00, 3.40] class: 1 | | | | | | |--- ID > 2746.00 | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | |--- Income > 113.50 | | | | | | |--- weights: [0.00, 41.65] class: 1 | |--- Education > 1.50 | | |--- Income <= 116.50 | | | |--- CCAvg <= 2.80 | | | | |--- Income <= 106.50 | | | | | |--- weights: [5.40, 0.00] class: 0 | | | | |--- Income > 106.50 | | | | | |--- Experience <= 31.50 | | | | | | |--- Age <= 41.50 | | | | | | | |--- Mortgage <= 51.50 | | | | | | | | |--- CCAvg <= 1.55 | | | | | | | | | |--- weights: [1.05, 0.00] class: 0 | | | | | | | | |--- CCAvg > 1.55 | | | | | | | | | |--- Experience <= 3.50 | | | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | | | | |--- Experience > 3.50 | | | | | | | | | | |--- Experience <= 15.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- Experience > 15.00 | | | | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | |--- Mortgage > 51.50 | | | | | | | | |--- Securities_Account <= 0.50 | | | | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | | | | |--- Securities_Account > 0.50 | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | |--- Age > 41.50 | | | | | | | |--- Online <= 0.50 | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | |--- Online > 0.50 | | | | | | | | |--- ID <= 4216.50 | | | | | | | | | |--- Experience <= 24.00 | | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | | | |--- Experience > 24.00 | | | | | | | | | | |--- Age <= 52.50 | | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | | |--- Age > 52.50 | | | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | | |--- ID > 4216.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- Experience > 31.50 | | | | | | |--- weights: [1.50, 0.00] class: 0 | | | |--- CCAvg > 2.80 | | | | |--- Age <= 63.50 | | | | | |--- ID <= 4852.50 | | | | | | |--- ID <= 3296.00 | | | | | | | |--- Income <= 114.50 | | | | | | | | |--- Experience <= 9.00 | | | | | | | | | |--- Age <= 32.00 | | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | | | |--- Age > 32.00 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- Experience > 9.00 | | | | | | | | | |--- weights: [0.00, 12.75] class: 1 | | | | | | | |--- Income > 114.50 | | | | | | | | |--- ID <= 1466.50 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- ID > 1466.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- ID > 3296.00 | | | | | | | |--- Income <= 100.00 | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | |--- Income > 100.00 | | | | | | | | |--- CD_Account <= 0.50 | | | | | | | | | |--- CCAvg <= 3.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- CCAvg > 3.50 | | | | | | | | | | |--- weights: [0.00, 4.25] class: 1 | | | | | | | | |--- CD_Account > 0.50 | | | | | | | | | |--- Experience <= 15.00 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- Experience > 15.00 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- ID > 4852.50 | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | |--- Age > 63.50 | | | | | |--- weights: [0.30, 0.00] class: 0 | | |--- Income > 116.50 | | | |--- Age <= 26.50 | | | | |--- weights: [0.00, 7.65] class: 1 | | | |--- Age > 26.50 | | | | |--- weights: [0.00, 181.05] class: 1
# Gini importance of features in the tree building
print(
pd.DataFrame(
model.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Income 0.59064 Family 0.14140 Education 0.12326 CCAvg 0.08345 ID 0.01900 Experience 0.01291 CD_Account 0.01248 Age 0.00695 Securities_Account 0.00458 Mortgage 0.00353 Online 0.00178 CreditCard 0.00000
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 10))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Choose the type of classifier
estimator = DecisionTreeClassifier(random_state=1, class_weight={0: 0.15, 0: 0.85})
# Grid of parameters to choose from
parameters = {
"max_depth": [5, 10, 15, None],
"criterion": ["entropy", "gini"],
"splitter": ["best", "random"],
"min_impurity_decrease": [0.00001, 0.0001, 0.01],
}
# Type of scoring used to compare parameter combinations
scorer = make_scorer(recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data
estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.85}, max_depth=5,
min_impurity_decrease=1e-05, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight={0: 0.85}, max_depth=5,
min_impurity_decrease=1e-05, random_state=1)decision_tree_tune_perf_train = model_performance_classification(
estimator, X_train, y_train
)
decision_tree_tune_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.99200 | 0.95166 | 0.96330 | 0.95745 |
confusion_matrix_sklearn(estimator, X_train, y_train)
decision_tree_tune_perf_test = model_performance_classification(
estimator, X_test, y_test
)
decision_tree_tune_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.98067 | 0.89262 | 0.91096 | 0.90169 |
plt.figure(figsize=(15, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Test report showing the rules of a decision tree
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- Income <= 113.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2170.05, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- Family <= 3.50 | | | | |--- ID <= 306.50 | | | | | |--- weights: [1.70, 1.00] class: 0 | | | | |--- ID > 306.50 | | | | | |--- weights: [34.85, 1.00] class: 0 | | | |--- Family > 3.50 | | | | |--- Age <= 35.00 | | | | | |--- weights: [9.35, 1.00] class: 0 | | | | |--- Age > 35.00 | | | | | |--- weights: [1.70, 5.00] class: 1 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account <= 0.50 | | | | |--- Age <= 26.50 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Age > 26.50 | | | | | |--- weights: [99.45, 9.00] class: 0 | | | |--- CD_Account > 0.50 | | | | |--- weights: [0.00, 5.00] class: 1 | | |--- Income > 92.50 | | | |--- Education <= 1.50 | | | | |--- CD_Account <= 0.50 | | | | | |--- weights: [23.80, 3.00] class: 0 | | | | |--- CD_Account > 0.50 | | | | | |--- weights: [0.85, 4.00] class: 1 | | | |--- Education > 1.50 | | | | |--- ID <= 3296.00 | | | | | |--- weights: [0.85, 16.00] class: 1 | | | | |--- ID > 3296.00 | | | | | |--- weights: [6.80, 8.00] class: 1 |--- Income > 113.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- weights: [331.50, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- weights: [0.00, 49.00] class: 1 | |--- Education > 1.50 | | |--- Income <= 116.50 | | | |--- CCAvg <= 2.80 | | | | |--- CCAvg <= 2.25 | | | | | |--- weights: [8.50, 0.00] class: 0 | | | | |--- CCAvg > 2.25 | | | | | |--- weights: [2.55, 1.00] class: 0 | | | |--- CCAvg > 2.80 | | | | |--- Experience <= 35.50 | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | |--- Experience > 35.50 | | | | | |--- weights: [1.70, 0.00] class: 0 | | |--- Income > 116.50 | | | |--- weights: [0.00, 222.00] class: 1
# Gini importance of features in the tree building
print(
pd.DataFrame(
estimator.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Education 0.37673 Income 0.37514 Family 0.15958 CCAvg 0.04673 CD_Account 0.02201 Age 0.00930 ID 0.00588 Experience 0.00463 Mortgage 0.00000 Securities_Account 0.00000 Online 0.00000 CreditCard 0.00000
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
clf = DecisionTreeClassifier(random_state=1, class_weight={0: 0.15, 1: 0.85})
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.00000 | -0.00000 |
| 1 | 0.00000 | -0.00000 |
| 2 | 0.00000 | -0.00000 |
| 3 | 0.00000 | -0.00000 |
| 4 | 0.00000 | -0.00000 |
| 5 | 0.00000 | -0.00000 |
| 6 | 0.00000 | -0.00000 |
| 7 | 0.00000 | -0.00000 |
| 8 | 0.00000 | -0.00000 |
| 9 | 0.00000 | -0.00000 |
| 10 | 0.00019 | 0.00078 |
| 11 | 0.00034 | 0.00111 |
| 12 | 0.00035 | 0.00217 |
| 13 | 0.00036 | 0.00253 |
| 14 | 0.00037 | 0.00290 |
| 15 | 0.00038 | 0.00328 |
| 16 | 0.00038 | 0.00367 |
| 17 | 0.00039 | 0.00405 |
| 18 | 0.00039 | 0.00523 |
| 19 | 0.00046 | 0.00706 |
| 20 | 0.00067 | 0.00773 |
| 21 | 0.00069 | 0.00842 |
| 22 | 0.00083 | 0.00925 |
| 23 | 0.00091 | 0.01016 |
| 24 | 0.00094 | 0.01110 |
| 25 | 0.00094 | 0.01299 |
| 26 | 0.00095 | 0.01584 |
| 27 | 0.00099 | 0.02079 |
| 28 | 0.00100 | 0.02278 |
| 29 | 0.00101 | 0.02379 |
| 30 | 0.00108 | 0.02596 |
| 31 | 0.00132 | 0.02727 |
| 32 | 0.00164 | 0.02891 |
| 33 | 0.00172 | 0.03063 |
| 34 | 0.00184 | 0.03247 |
| 35 | 0.00229 | 0.03476 |
| 36 | 0.00274 | 0.03750 |
| 37 | 0.00334 | 0.04084 |
| 38 | 0.00353 | 0.04436 |
| 39 | 0.00514 | 0.04950 |
| 40 | 0.00901 | 0.05851 |
| 41 | 0.01005 | 0.06857 |
| 42 | 0.02253 | 0.09110 |
| 43 | 0.06112 | 0.21334 |
| 44 | 0.25380 | 0.46714 |
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("Effective alpha")
ax.set_ylabel("Total impurity of leaves")
ax.set_title("Total impurity vs Effective alpha for training set")
plt.show()
# Training the decision tree using the effective alphas
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(
random_state=1, ccp_alpha=ccp_alpha, class_weight={0: 0.15, 1: 0.85}
)
clf.fit(X_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is : {} with ccp_alpha : {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is : 1 with ccp_alpha : 0.2537957148948088
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("Number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("Depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
# Creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.009008434301508094,
class_weight={0: 0.15, 1: 0.85}, random_state=1)
best_model.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.009008434301508094,
class_weight={0: 0.15, 1: 0.85}, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(ccp_alpha=0.009008434301508094,
class_weight={0: 0.15, 1: 0.85}, random_state=1)decision_tree_postpruned_perf_train = model_performance_classification(
best_model, X_train, y_train
)
decision_tree_postpruned_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.93514 | 0.99094 | 0.59420 | 0.74292 |
confusion_matrix_sklearn(best_model, X_train, y_train)
decision_tree_postpruned_perf_test = model_performance_classification(
best_model, X_test, y_test
)
decision_tree_postpruned_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.93933 | 0.98658 | 0.62288 | 0.76364 |
confusion_matrix_sklearn(best_model, X_test, y_test)
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
best_model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=True,
class_names=['No Loan', 'Loan'],
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
plt.show()
# Text report showing the rules of a decision tree
print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [374.10, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- weights: [18.60, 18.70] class: 1 |--- Income > 98.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- weights: [67.65, 2.55] class: 0 | | |--- Family > 2.50 | | | |--- weights: [1.65, 45.90] class: 1 | |--- Education > 1.50 | | |--- Income <= 116.50 | | | |--- weights: [13.35, 25.50] class: 1 | | |--- Income > 116.50 | | | |--- weights: [0.00, 188.70] class: 1
# Gini importance of features in the tree building
print(
pd.DataFrame(
best_model.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Income 0.64570 Family 0.15824 Education 0.14091 CCAvg 0.05514 ID 0.00000 Age 0.00000 Experience 0.00000 Mortgage 0.00000 Securities_Account 0.00000 CD_Account 0.00000 Online 0.00000 CreditCard 0.00000
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Training performance comparison
models_train_comp_df = pd.concat(
[
decision_tree_perf_train.T,
decision_tree_tune_perf_train.T,
decision_tree_postpruned_perf_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree Sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Training performance comparison : ")
models_train_comp_df
Training performance comparison :
| Decision Tree Sklearn | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 1.00000 | 0.99200 | 0.93514 |
| Recall | 1.00000 | 0.95166 | 0.99094 |
| Precision | 1.00000 | 0.96330 | 0.59420 |
| F1 | 1.00000 | 0.95745 | 0.74292 |
# Test performance comparison
models_test_comp_df = pd.concat(
[
decision_tree_perf_test.T,
decision_tree_tune_perf_test.T,
decision_tree_postpruned_perf_test.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree Sklearn",
"Decision Tree (Pre-Pruning)",
"Decision Tree (Post-Pruning)",
]
print("Test performance comparison : ")
models_test_comp_df
Test performance comparison :
| Decision Tree Sklearn | Decision Tree (Pre-Pruning) | Decision Tree (Post-Pruning) | |
|---|---|---|---|
| Accuracy | 0.98067 | 0.98067 | 0.93933 |
| Recall | 0.89262 | 0.89262 | 0.98658 |
| Precision | 0.91096 | 0.91096 | 0.62288 |
| F1 | 0.90169 | 0.90169 | 0.76364 |
To enhance the success rate of personal loan campaigns, the marketing team should focus on customer segments that exhibit the following characteristics:
Annual income above $120,000
Customers with higher income levels typically have better repayment capacity and financial reliability.
Education level: Graduate or Advanced/Professional
Higher educational attainment often correlates with financial awareness and a stronger likelihood of loan acceptance.
Family size of three or more
Larger households may have greater financial needs, making them more inclined to consider personal loans.
Ownership of a Certificate of Deposit (CD) account
These customers show strong banking relationships and lower credit risk due to available collateral.
Average credit card spending above $2,500 per month
High spenders may have short-term liquidity needs that personal loans can fulfill.
Active mortgage account
Individuals with home loans may seek additional funds for home improvement, debt consolidation, or other needs.
Age above 35 years
Mid-aged customers are often financially active and managing various responsibilities.
More than nine years of professional experience
Career stability often reflects steady income and greater eligibility for loans.
Residence in key counties: Los Angeles, Santa Clara, San Diego, and Riverside
These regions have shown favorable engagement with personal loan products based on model outcomes.
Targeting these segments is likely to improve loan campaign conversion rates while managing risk effectively.